No missing data. Check data types of each variable:
We will definitely need to change the data type for the date column, and potentially look into creating factors for some of the more ordinal variables.
Part 2: EDA
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
library(Hmisc)
library(tidyverse)
library(dplyr)
library(faraway)
Attaching package: ‘faraway’
The following objects are masked from ‘package:survival’:
rats, solder
The following object is masked from ‘package:lattice’:
melanoma
library(Hmisc)
library(tidyverse)
library(dplyr)
library(faraway)
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
names(house)
[1] "id" "date" "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot"
[8] "floors" "waterfront" "view" "condition" "grade" "sqft_above" "sqft_basement"
[15] "yr_built" "yr_renovated" "zipcode" "lat" "long" "sqft_living15" "sqft_lot15"
[22] "num"
Prior to dropping Date and Geotags consider using them for plotting, for example transaction counts by dates?
names(house)
[1] "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot" "floors" "waterfront"
[8] "view" "condition" "grade" "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
[15] "sqft_living15" "sqft_lot15"
describe(house)
house
16 Variables 21613 Observations
-----------------------------------------------------------------------------------------------------------------------------
price
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 3625 1 540182 329562 210000 245000 321950 450000 645000 887000 1160000
lowest : 75000 78000 80000 81000 82000, highest: 5350000 5570000 6890000 7060000 7700000
-----------------------------------------------------------------------------------------------------------------------------
bedrooms
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 13 0.871 3.371 0.946 2 2 3 3 4 4 5
lowest : 0 1 2 3 4, highest: 8 9 10 11 33
Value 0 1 2 3 4 5 6 7 8 9 10 11 33
Frequency 13 199 2760 9824 6882 1601 272 38 13 6 3 1 1
Proportion 0.001 0.009 0.128 0.455 0.318 0.074 0.013 0.002 0.001 0.000 0.000 0.000 0.000
-----------------------------------------------------------------------------------------------------------------------------
bathrooms
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 30 0.974 2.115 0.8444 1.00 1.00 1.75 2.25 2.50 3.00 3.50
lowest : 0.00 0.50 0.75 1.00 1.25, highest: 6.50 6.75 7.50 7.75 8.00
-----------------------------------------------------------------------------------------------------------------------------
sqft_living
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 1038 1 2080 978.4 940 1090 1427 1910 2550 3250 3760
lowest : 290 370 380 384 390, highest: 9640 9890 10040 12050 13540
-----------------------------------------------------------------------------------------------------------------------------
sqft_lot
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 9782 1 15107 17855 1800 3322 5040 7618 10688 21398 43339
lowest : 520 572 600 609 635, highest: 982998 1024068 1074218 1164794 1651359
-----------------------------------------------------------------------------------------------------------------------------
floors
n missing distinct Info Mean Gmd
21613 0 6 0.823 1.494 0.5563
lowest : 1.0 1.5 2.0 2.5 3.0, highest: 1.5 2.0 2.5 3.0 3.5
Value 1.0 1.5 2.0 2.5 3.0 3.5
Frequency 10680 1910 8241 161 613 8
Proportion 0.494 0.088 0.381 0.007 0.028 0.000
-----------------------------------------------------------------------------------------------------------------------------
waterfront
n missing distinct
21613 0 2
Value 0 1
Frequency 21450 163
Proportion 0.992 0.008
-----------------------------------------------------------------------------------------------------------------------------
view
n missing distinct
21613 0 5
lowest : 0 1 2 3 4, highest: 0 1 2 3 4
Value 0 1 2 3 4
Frequency 19489 332 963 510 319
Proportion 0.902 0.015 0.045 0.024 0.015
-----------------------------------------------------------------------------------------------------------------------------
condition
n missing distinct
21613 0 5
lowest : 1 2 3 4 5, highest: 1 2 3 4 5
Value 1 2 3 4 5
Frequency 30 172 14031 5679 1701
Proportion 0.001 0.008 0.649 0.263 0.079
-----------------------------------------------------------------------------------------------------------------------------
grade
n missing distinct
21613 0 12
lowest : 1 3 4 5 6 , highest: 9 10 11 12 13
Value 1 3 4 5 6 7 8 9 10 11 12 13
Frequency 1 3 29 242 2038 8981 6068 2615 1134 399 90 13
Proportion 0.000 0.000 0.001 0.011 0.094 0.416 0.281 0.121 0.052 0.018 0.004 0.001
-----------------------------------------------------------------------------------------------------------------------------
sqft_above
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 946 1 1788 876.2 850 970 1190 1560 2210 2950 3400
lowest : 290 370 380 384 390, highest: 7880 8020 8570 8860 9410
-----------------------------------------------------------------------------------------------------------------------------
sqft_basement
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 306 0.776 291.5 422.2 0 0 0 0 560 970 1190
lowest : 0 10 20 40 50, highest: 3260 3480 3500 4130 4820
-----------------------------------------------------------------------------------------------------------------------------
yr_built
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 116 1 1971 33.38 1915 1926 1951 1975 1997 2007 2011
lowest : 1900 1901 1902 1903 1904, highest: 2011 2012 2013 2014 2015
-----------------------------------------------------------------------------------------------------------------------------
yr_renovated
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 70 0.122 84.4 161.7 0 0 0 0 0 0 0
lowest : 0 1934 1940 1944 1945, highest: 2011 2012 2013 2014 2015
Value 0 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Frequency 20699 1 2 6 4 13 12 16 27 25 43 88 99 84 112 156 82 144
Proportion 0.958 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.002 0.004 0.005 0.004 0.005 0.007 0.004 0.007
For the frequency table, variable is rounded to the nearest 5
-----------------------------------------------------------------------------------------------------------------------------
sqft_living15
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 777 1 1987 743.2 1140 1256 1490 1840 2360 2930 3300
lowest : 399 460 620 670 690, highest: 5600 5610 5790 6110 6210
-----------------------------------------------------------------------------------------------------------------------------
sqft_lot15
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
21613 0 8689 1 12768 13404 1999 3667 5100 7620 10083 17852 37063
lowest : 651 659 660 748 750, highest: 434728 438213 560617 858132 871200
-----------------------------------------------------------------------------------------------------------------------------
ggplot(house, aes(x=sqft_living, y=price, color=waterfront))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with waterfornt indicator")
`geom_smooth()` using formula 'y ~ x'

ggplot(house, aes(x=sqft_living, y=price, color=view))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with view indicator")
`geom_smooth()` using formula 'y ~ x'

ggplot(house, aes(x=sqft_living, y=price, color=condition))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with condition indicator")
`geom_smooth()` using formula 'y ~ x'

ggplot(house, aes(x=sqft_living, y=price, color=grade))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with grade indicator")
`geom_smooth()` using formula 'y ~ x'

Question: how to deal with indicator (ordinary) varibales in this case?






# Changing `view` to 0 for regular view and 1 for every other view
house$view <- asfactor(ifelse(house$view!=0, 1, 0))
Error in asfactor(ifelse(house$view != 0, 1, 0)) :
could not find function "asfactor"

Checking possible interactions after mapping categorical variables to a larger classes
ggplot(house, aes(x=sqft_living, y=price, color=view))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with view indicator")
`geom_smooth()` using formula 'y ~ x'

ggplot(house, aes(x=sqft_living, y=price, color=condition))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with condition indicator")
`geom_smooth()` using formula 'y ~ x'

ggplot(house, aes(x=sqft_living, y=price, color=grade))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with grade indicator")
`geom_smooth()` using formula 'y ~ x'

Converting quantitative predictor floors to a factor 1, 2, 3.
house$floors <- ifelse(house$floors < 2, 1, iflese(house$floors < 3, 2, eflse(>=3, 3, 0)))
Error: unexpected '>=' in "house$floors <- ifelse(house$floors < 2, 1, iflese(house$floors < 3, 2, eflse(>="
ggplot(house, aes(x=sqft_living, y=price, color=floors))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
labs(x="sqft_living",
y="price",
title="Scatter plot of price against sqft_living with floors indicator")
`geom_smooth()` using formula 'y ~ x'

Computing age of the house
unique((house$bathrooms))
[1] 1.00 2.25 3.00 2.00 4.50 1.50 2.50 1.75 2.75 3.25 4.00 3.50 0.75 4.75 5.00 4.25 3.75 0.00 1.25 5.25 6.00 0.50 5.50 6.75
[25] 5.75 8.00 7.50 7.75 6.25 6.50
names(house)
[1] "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot" "floors" "waterfront"
[8] "view" "condition" "grade" "sqft_above" "sqft_basement" "sqft_living15" "sqft_lot15"
[15] "age"
Other stuff
dp1<-ggplot(house, aes(x=sqft_living, color=waterfront))+
geom_density()+
labs(title="Sqft_living by waterfront")
dp2<-ggplot(house,aes(x=sqft_lot, color=waterfront))+
geom_density()+
labs(title="Sqft_lot by waterfront")
dp3<-ggplot(house,aes(x=sqft_above, color=waterfront))+
geom_density()+
labs(title="Sqft_above by waterfront")
dp4<-ggplot(house,aes(x=sqft_basement, color=waterfront))+
geom_density()+
labs(title="Sqft_basement by waterfront")
dp5<-ggplot(house,aes(x=sqft_living15, color=waterfront))+
geom_density()+
labs(title="Sqft_living15 by waterfront")
dp6<-ggplot(house,aes(x=sqft_lot15, color=waterfront))+
geom_density()+
labs(title="Sqft_lot15 by waterfront")
##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(dp1, dp2, dp3, dp4, dp5, dp6, ncol = 2, nrow = 3)
dp1<-ggplot(house, aes(x=yr_built, color=waterfront))+
geom_density()+
labs(title="Yr_built by waterfront")
dp2<-ggplot(house,aes(x=yr_renovated, color=waterfront))+
geom_density()+
labs(title="Yr_renovated by waterfront")
dp3<-ggplot(house,aes(x=floors, color=waterfront))+
geom_density()+
labs(title="Floors by waterfront")
dp4<-ggplot(house,aes(x=bedrooms, color=waterfront))+
geom_density()+
labs(title="Bedrooms by waterfront")
dp5<-ggplot(house,aes(x=bathrooms, color=waterfront))+
geom_density()+
labs(title="Bathrooms by waterfront")
dp6<-ggplot(house,aes(x=grade, color=waterfront))+
geom_density()+
labs(title="Grade by waterfront")
##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(dp1, dp2, dp3, dp4, dp5, dp6, ncol = 2, nrow = 3)
corr <- round(cor(house[,c("price",quant_vars)]), 1)
Error in `[.data.frame`(house, , c("price", quant_vars)) :
undefined columns selected
---
title: 'STAT 6021: Project 2'
author: "Connie Cui"
date: "11/26/2021"
output:
  html_document:
    df_print: paged
  html_notebook: default
---


Load in packages
```{r}
library(tidyverse)
library(ggplot2)
```
import data:
```{r}
house <- read.csv("house_data.csv")
head(house)
```
find missing data:
```{r}
# list rows of data that have missing values
house[!complete.cases(house),]
```
No missing data.
Check data types of each variable:
```{r}
str(house)
```
We will definitely need to change the data type for the date column, and potentially look into creating factors for some of the more ordinal variables.
```{r}
house$date = substr(house$date,1,nchar(house$date)-7)
head(house)
```
Convert date variabe to date type:
```{r}
house$date <- as.Date(house$date, "%Y%m%d")
head(house)
```
Turning view, condition, and grade into ordered factors:
```{r}
house$view <- factor(house$view, ordered = TRUE, levels = c(0, 1, 2, 3, 4))
house$condition <- factor(house$condition, ordered = TRUE, levels = c(1, 2, 3, 4, 5))
house$grade <- factor(house$grade, ordered = TRUE, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
house$waterfront <- factor(house$waterfront, ordered = TRUE, levels = c(0, 1))
```



## Part 2: EDA

```{r}
#install.packages("ggcorrplot")
#install.packages("miscset")
library(miscset)
library(Hmisc)
library(tidyverse)
library(dplyr)
library(faraway)
library(gridExtra)
```

```{r}
names(house)
```

#### Prior to dropping Date and Geotags consider using them for plotting, for example transaction counts by dates?


```{r}
house <- subset(house, select=-c(id,num, date, zipcode, lat, long))
names(house)
```

```{r}
describe(house)
```


```{r}
ggplot(house, aes(x=sqft_living, y=price, color=waterfront))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with waterfornt indicator")
```


```{r}
ggplot(house, aes(x=sqft_living, y=price, color=view))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with view indicator")
```


```{r}
ggplot(house, aes(x=sqft_living, y=price, color=condition))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with condition indicator")
```


```{r}
ggplot(house, aes(x=sqft_living, y=price, color=grade))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with grade indicator")
```

#### Question: how to deal with indicator (ordinary) varibales in this case?


```{r}
quant_vars = c("yr_built", "yr_renovated",
               "floors", "bedrooms", "bathrooms", 
               "sqft_living", "sqft_lot", "sqft_above", "sqft_basement", 
               "sqft_living15", "sqft_lot15")

cat_vars = c("waterfront", "view", "condition", "grade")

library(Hmisc)
hist.data.frame(house[,quant_vars])
```

```{r}
ggplotGrid(ncol = 2,
  lapply(c("view", "waterfront", "condition", "grade"),
    function(col) {
        ggplot(house, aes_string(col)) + geom_bar() + coord_flip()
    }))
```

```{r}
ggplot(house, aes(x=waterfront, y=price))+
geom_boxplot()+
labs(x="waterfront", y="price", title="Price by waterfront")
```




```{r}
ggplot(house, aes(x=view, y=price))+
geom_boxplot()+
labs(x="view", y="price", title="Price by view")
```


```{r}
ggplot(house, aes(x=condition, y=price))+
geom_boxplot()+
labs(x="condition", y="price", title="Price by condition")
```


```{r}
ggplot(house, aes(x=grade, y=price))+
geom_boxplot()+
labs(x="grade", y="price", title="Price by grade")
```



```{r}
head(house)
```

```{r}
# Changing `view` to 0 for regular view and 1 for every other view
house$view <- factor(ifelse(house$view!=0, 1, 0))
# Changing `condition` to 0 for everything below 3 and 1 otherwise
house$condition <- factor(ifelse(house$condition==1 | house$condition==2 | house$condition==3, 0, 1))
# Changing `grade` to 0 for everything below 7 and 1 otherwise
house$grade <- factor(ifelse(house$grade==1 | house$grade==2 | house$grade==3 |
                      house$grade==4 | house$grade==5 | house$grade==7 , 0, 1))
```


```{r}
ggplotGrid(ncol = 2,
  lapply(c("view", "waterfront", "condition", "grade"),
    function(col) {
        ggplot(house, aes_string(col)) + geom_bar() + coord_flip()
    }))
```

#### Checking possible interactions after mapping categorical variables to a larger classes

```{r}
ggplot(house, aes(x=sqft_living, y=price, color=view))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with view indicator")
```
```{r}
ggplot(house, aes(x=sqft_living, y=price, color=condition))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with condition indicator")
```


```{r}
ggplot(house, aes(x=sqft_living, y=price, color=grade))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with grade indicator")
```



#### Converting quantitative predictor floors to a factor 1, 2, 3.


```{r}
house$floors <- factor(ifelse(house$floors < 2, 1, ifelse(house$floors < 3, 2, ifelse(house$floors>=3, 3, 0))))
```



```{r}
ggplot(house, aes(x=sqft_living, y=price, color=floors))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with floors indicator")
```


#### Computing age of the house

```{r}
house$age = ifelse(2021-house$yr_renovated >= 2021-house$yr_built, 2021-house$yr_built, 2021-house$yr_renovated)
head(house)
```

```{r}
hist(house$bathrooms)
unique((house$bathrooms))
```

```{r}
house <- subset(house, select=-c(yr_renovated, yr_built))
names(house)
```


#### Other stuff 


```{r}
dp1<-ggplot(house, aes(x=sqft_living, color=waterfront))+
geom_density()+
labs(title="Sqft_living by waterfront")


dp2<-ggplot(house,aes(x=sqft_lot, color=waterfront))+
geom_density()+
labs(title="Sqft_lot by waterfront")


dp3<-ggplot(house,aes(x=sqft_above, color=waterfront))+
geom_density()+
labs(title="Sqft_above by waterfront")


dp4<-ggplot(house,aes(x=sqft_basement, color=waterfront))+
geom_density()+
labs(title="Sqft_basement by waterfront")

dp5<-ggplot(house,aes(x=sqft_living15, color=waterfront))+
geom_density()+
labs(title="Sqft_living15 by waterfront")


dp6<-ggplot(house,aes(x=sqft_lot15, color=waterfront))+
geom_density()+
labs(title="Sqft_lot15 by waterfront")


##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(dp1, dp2, dp3, dp4, dp5, dp6, ncol = 2, nrow = 3)
```

               


```{r}
dp1<-ggplot(house, aes(x=yr_built, color=waterfront))+
geom_density()+
labs(title="Yr_built by waterfront")


dp2<-ggplot(house,aes(x=yr_renovated, color=waterfront))+
geom_density()+
labs(title="Yr_renovated by waterfront")


dp3<-ggplot(house,aes(x=floors, color=waterfront))+
geom_density()+
labs(title="Floors by waterfront")


dp4<-ggplot(house,aes(x=bedrooms, color=waterfront))+
geom_density()+
labs(title="Bedrooms by waterfront")

dp5<-ggplot(house,aes(x=bathrooms, color=waterfront))+
geom_density()+
labs(title="Bathrooms by waterfront")


dp6<-ggplot(house,aes(x=grade, color=waterfront))+
geom_density()+
labs(title="Grade by waterfront")


##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(dp1, dp2, dp3, dp4, dp5, dp6, ncol = 2, nrow = 3)
```


```{r}
corr <- round(cor(house[,c("price",quant_vars)]), 1)
#head(corr[, 1:6])
library(ggcorrplot)
ggcorrplot(corr, 
           method = "circle", 
           lab = TRUE,
           type = "lower", 
           outline.color = "white", 
           ggtheme = ggplot2::theme_gray,
           colors = c("#6D9EC1", "white", "#E46726"))
```





## TODO: interpreting models https://cran.r-project.org/web/packages/jtools/vignettes/summ.html

```{r}
fit <- lm(price ~ . - sqft_basement, data = house)
summary(fit)
```

```{r}
plot(fit)
```





```{r}

reduced <- lm(log(price) ~ sqft_living+bedrooms+bathrooms+floors+waterfront+age+sqft_living15+sqft_lot15, 
              data = house)
summary(reduced)
```


```{r}
plot(reduced)
```



```{r}
tiny <- lm(price ~ sqft_living15+bathrooms+waterfront, 
              data = house)
summary(tiny)
```



```{r}
plot(tiny)
```



```{r}
plot(house$price, house$sqft_living)
```

```{r}
plot(house$price, house$bathrooms)
```



```{r}
install.packages("plotly")
```




```{r}

library(plotly)
fig = list(
  data = list(
    list(
      x = house$floors,
      y = house$bedrooms,
      type = 'bar'
    )
  ),
  layout = list(
    title = 'A Figure Specified By R List',
    plot_bgcolor='#e5ecf6', 
         xaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff'), 
         yaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff')
  )
)
# To display the figure defined by this list, use the plotly_build function
plotly_build(fig)
```




